Introduction

Welcome! The purpose of this project is to produce an algorithm that can help the online real-estate website, Zillow.com predict home sale prices with greater accuracy. Improving real-estate valuations are important for several reasons. Namely it:

  1. Improves user experience
  2. Provides context for anticipated property taxes and homeowner’s insurance
  3. Alleviates historic and systemic biases commonplace with home valuations in neighborhoods of color
  4. Supports informed decision-making among families regarding long-term investing (b/c buying a home is an investment)

The gains from such an algorithm are worthwhile, however this project proved to be a challenging exercise. Namely, it is incredibly difficult to quantify the features which influence home values without leaning into historical disparities (homes in ‘good’ versus ‘bad’ areas are often paired with race|class|amenity imparity.) Likewise, the presence of colinear variables may effect the performance of our model.

We employed the hedonic model for this project (HM). Simply put, HM refers to a regression model that estimates the influence various factors have on the price of a good (i.e.home) and is commonly used in real estate pricing. In this case, home price served as our dependent variable, whilst various features served as our independent variables. We identified a collection of both physical and demographic features we believed influence the price of a home, cleaned these datasets and tested their significance towards home valuation. Furthermore, we checked our results through cross-validation testing, and employed statistical metrics such as mean absolute error (MAE), mean absolute percentage error (MAPE), and Moran’s I.

Ultimately, we produced a functional model though with a major weakness: it is not generalizable across majority white vs majority non-white neighborhoods. Specifically, we saw an increase in our MAPE within majority non-white neighborhoods as well as lower average home valuations. We review the reasons for this later.

Nonetheless, this project was a great exercise. We look forward to you reviewing our model and code!

Click “code” on the right-hand side to view what takes place behind-the-scenes.

Data Wrangling

To gather data, our team focused on Mecklenberg County, NC (Charlotte Metro Area) and sourced information from the county’s open data website, as well as the American Community Survey and U.S.Census.

Charlotte Home Sales Data

To begin, we will import a home sales dataset that includes variables like location, housing characteristics, and home quality for the Charlotte Metro Area. After, we will ‘clean’ our data by creating useful columns such as, “building grade” as a numeric value (where higher values correspond to greater quality), “age of home (age)”, “price per square foot (pricesqft)” and calculating the # of “total baths (totbaths)” by joining full and half-bathroom information. Moving forward, we’ll refer to this home sales data as “internal variables.”

#Import sales data, remove columns we won't need, add useful columns, set CRS for North Carolina
library(dplyr)
library(sf)
CLT_internal <- 
  st_read("https://github.com/mafichman/MUSA_508_Lab/raw/main/Midterm/data/2022/studentData.geojson") %>%
  st_transform('ESRI:103500')

CLT_internal <- CLT_internal[c(5,9,20,21,26,28,30:46,57:60,67,68,70,71,72)] %>%
  mutate(
    age = 2022 - yearbuilt,
    sqft = (totalac * 43560),
     pricesqft = ((totalac * 43560)/price),
        totbaths = (halfbaths*0.5)+(fullbaths))

  CLT_internal$quality <- recode(CLT_internal$bldggrade, MINIMUM = 1 , FAIR = 2, AVERAGE = 3, GOOD = 4, VERYGOOD = 5, EXCELLENT = 6, CUSTOM = 7)

Adding Amenities Data

To build a strong algorithm (model), it’s important to include variables that relate to the housing market such as local schools, grocery stores, and parks. We’ll refer to these variables as “amenities.”

# Adding school data 
CLT_schools <- 
  st_read("https://github.com/mtdunst/Zestimate-Project/raw/main/Schools.geojson") %>%
  st_transform(st_crs(CLT_internal))

# Adding grocery store data
CLT_grocery <- 
  st_read("Grocery_pts.geojson") %>%
  st_transform(st_crs(CLT_internal))

# Adding parks data 
CLT_parks <- 
  st_read("https://github.com/mtdunst/Zestimate-Project/raw/main/Parks.geojson") %>%
  st_transform(st_crs(CLT_internal))

Adding Spatial Structure Data

Finally, we will add variables that provide demographic and environmental data for the Charlotte Metro Area. Specifically, we will include educational attainment, household income, and neighborhoods data. We’ll refer to these variables as “spatial structure.”

# Adding demographic data
CLT_demo <- 
  get_acs(geography = "tract", 
          variables = c("B19013_001E", "B15003_022E","B15003_001E"), 
          year=2020, state=37, county=119, 
          geometry=TRUE, output="wide") %>%
  st_transform(st_crs(CLT_internal)) %>%
  dplyr::select( -NAME, -B19013_001M, -B15003_022M, -B15003_001M)

CLT_demo <-
  CLT_demo %>%
  rename(HH_inc = B19013_001E, 
         College = B15003_022E,
         College_age_pop = B15003_001E) %>%
  mutate(college_perc = College/College_age_pop) %>%
  dplyr::select( -College, -College_age_pop)

# Adding neighborhood data 
CLT_neighborhoods <- 
  st_read("https://github.com/mtdunst/Zestimate-Project/raw/main/School_districts.geojson") %>%
  st_transform(st_crs(CLT_internal))

Exploratory Data Analysis

Orienting Our Variables

So far, we have added internal, amenities, and spatial structure variables. However, in order to build our model and analyze how these variables relate to home sales, we must modify them. We’ll achieve this using 2 techniques:

1. K-nearest neighbor (KNN): this will find the distance between a given home and the most near amenities (school, grocery store, park).

2. Spatial join (SJ): this will join our spatial structure data (educational attainment, neighborhoods) to our internal varies (Charlotte homes sales)

Note to instructor: the nn_function did not work as normal, perhaps due to the geometry featuring multiple points (versus just X and Y), so we took the central point of each feature.
# Most near school, grocery store, and park 
CLT_internal <-
  CLT_internal %>% 
    mutate(
      school_nn1 = nn_function(st_coordinates(st_centroid(CLT_internal)), st_coordinates(st_centroid(CLT_schools)),k = 1),
      grocery_nn1 = nn_function(st_coordinates(st_centroid(CLT_internal)), st_coordinates(st_centroid(CLT_grocery)), k = 1),
       park_nn1 = nn_function(st_coordinates(st_centroid(CLT_internal)), st_coordinates(st_centroid(CLT_parks)), k = 1))

# Spatial join 
CLT_internal <- 
  st_join(CLT_internal,CLT_demo) 

CLT_internal <- 
  st_join(CLT_internal, CLT_neighborhoods)

Summary Statistics

Below are summary statistics tables for each variable category (internal, amenities, spatial structure).

#Internal variables 
ss_internal <- CLT_internal
ss_internal <- st_drop_geometry(ss_internal)
ss_internal <- ss_internal %>%
  dplyr::select("sqft","pricesqft", "totbaths", "yearbuilt") 
stargazer(as.data.frame(ss_internal), type="text", digits=1, title = "Descriptive Statistics for Charlotte Metro Area Homes Internal Variables")
## 
## Descriptive Statistics for Charlotte Metro Area Homes Internal Variables
## =======================================================
## Statistic   N      Mean    St. Dev.   Min      Max     
## -------------------------------------------------------
## sqft      46,183 77,176.9 4,648,606.0 0.0 425,034,086.0
## pricesqft 46,138  Inf.0               0.0     Inf.0    
## totbaths  46,183   2.6        0.9     0.0     11.0     
## yearbuilt 46,183 1,993.4     31.8      0      2,022    
## -------------------------------------------------------
#Amenities 
ss_amenities <- CLT_internal
ss_amenities <- st_drop_geometry(ss_amenities)
ss_amenities <- ss_amenities %>%
  dplyr::select("school_nn1", "grocery_nn1", "park_nn1") 
stargazer(as.data.frame(ss_amenities), type="text", digits=1, title = "Descriptive Statistics for Charlotte Metro Area Homes Amentity Variables")
## 
## Descriptive Statistics for Charlotte Metro Area Homes Amentity Variables
## ================================================
## Statistic     N     Mean   St. Dev. Min    Max  
## ------------------------------------------------
## school_nn1  46,183 3,266.8 1,390.4  8.2  8,369.4
## grocery_nn1 46,183 1,737.8 1,274.4  37.0 8,599.7
## park_nn1    46,183 1,239.8  767.7   21.0 5,026.3
## ------------------------------------------------
# Spatial Structure
ss_spatial <- CLT_internal
ss_spatial <- st_drop_geometry(ss_spatial)
ss_spatial <- ss_spatial %>%
  dplyr::select("HH_inc", "college_perc", "FID") 
stargazer(as.data.frame(ss_spatial), type="text", digits=1, title = "Descriptive Statistics for Charlotte Metro Area Homes Internal Variables")
## 
## Descriptive Statistics for Charlotte Metro Area Homes Internal Variables
## ====================================================
## Statistic      N      Mean   St. Dev.  Min     Max  
## ----------------------------------------------------
## HH_inc       46,180 86,175.7 37,415.7 17,685 238,750
## college_perc 46,180   0.3      0.1     0.03    0.6  
## FID          46,183   26.0     10.6     0      43   
## ----------------------------------------------------

Correlation Matrix

Below is a table visualizing correlation between our variables. We can see the home price maintains a positive correlation with the following variables (in order of strength):

  • heated square footed of the home
  • home quality
  • total bathrooms
  • household income
  • percentage of residents with college degrees
  • bedrooms
  • number of fireplaces in the home

We can use this matrix to inform our variable selection for our model.

numericVars <- select_if(st_drop_geometry(CLT_internal), is.numeric) %>% na.omit()

ggcorrplot(
  round(cor(numericVars), 1),
  p.mat = cor_pmat(numericVars),
  colors = c("deepskyblue", "grey100", "firebrick1"),
  type="lower",
  insig = "blank") +  
    labs(title = "Correlation Matrix of Numeric Variables", tl.cex = 0.5, tl.col = "black") +
  theme(axis.text.x = element_text(size = 10)) +
  plotTheme()

Scatterplots

Below are 4 home price correlation scatterplots based upon the results of our correlation matrix:

st_drop_geometry(CLT_internal) %>% 
  dplyr::select(price, quality, heatedarea, HH_inc, yearbuilt) %>% 
    filter(price < 10000000) %>%
  gather(Variable, Value, -price) %>% 
   ggplot(aes(Value, price)) +
     geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "hotpink") +
     facet_wrap(~Variable, ncol = 3, scales = "free") +
     labs(title = "Price as a function of Internal and Spatial Variables") +
  theme(axis.text.x = element_text(size=7)) +
  plotTheme()

Maps

Below are 4 maps including:

1. A map of our dependent variable (price)

2. A map of park locations

3. A map of nearby grocery stores

4. A map of school quality

Note: the first 3 maps are in relation to home prices within the Charlotte Metro Area.
grid.arrange(ncol=2,
ggplot() +
  geom_sf(data = CLT_neighborhoods, fill = "grey40") +
  geom_sf(data = CLT_internal, aes(colour = q5(price)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(CLT_internal,"price"),
                   name="Quintile\nBreaks") +
  labs(title="Home Price", subtitle="Charlotte Metro Area") +
  labs(color = "Observed Sales Price (quintiles)") +
   theme(plot.title=element_text(size=12, face='bold'),
         legend.key.size = unit(1.5, 'cm'), #change legend key size
        legend.key.height = unit(1, 'cm'), #change legend key height
        legend.key.width = unit(1, 'cm'), #change legend key width
        legend.title = element_text(size=14), #change legend title font size
        legend.text = element_text(size=10)) + #change legend text font size
  mapTheme(),

ggplot() +
  geom_sf(data = CLT_neighborhoods, fill = "grey70") +
  geom_sf(data = CLT_internal, aes(colour = q5(price)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(CLT_internal,"price"),
                   name="Quintile\nBreaks") +
  geom_sf(data = CLT_parks, color="darkgreen") +
  labs(title="Park Locations vs Home Price", subtitle="Charlotte Metro Area") + 
  theme(plot.title=element_text(size=12, face='bold'),
        legend.key.size = unit(1.5, 'cm'), 
        legend.key.height = unit(1, 'cm'),
        legend.key.width = unit(1, 'cm'), 
        legend.title = element_text(size=14), 
        legend.text = element_text(size=10)) + 
  mapTheme(),

ggplot() +
  geom_sf(data = CLT_neighborhoods, fill = "grey30") +
  geom_sf(data = CLT_internal, aes(colour = q5(price)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(CLT_internal,"price"),
                   name="Quintile\nBreaks") +
  geom_sf(data = CLT_grocery, color="deepskyblue") +
  labs(title="Grocery Store Locations vs Home Price", subtitle="Charlotte Metro Area") + 
  theme(plot.title=element_text(size=12, face='bold'),
    legend.key.size = unit(1.5, 'cm'), 
        legend.key.height = unit(1, 'cm'),
        legend.key.width = unit(1, 'cm'), 
        legend.title = element_text(size=14), 
        legend.text = element_text(size=10)) + 
  mapTheme(),

ggplot() +
  geom_sf(data = CLT_schools, aes(fill=factor(Quality), color=factor(Quality))) +
  scale_fill_brewer(palette="RdYlGn") +
  scale_color_brewer(palette="RdYlGn") +
  labs(title="School Quality", subtitle="Niche.com ratings; Charlotte Metro Area") +
   theme(plot.title=element_text(size=12, face='bold'),
     legend.key.size = unit(1.25, 'cm'), 
        legend.key.height = unit(1, 'cm'), 
        legend.key.width = unit(1, 'cm'),
        legend.title = element_text(size=14), 
        legend.text = element_text(size=10)) + 
  mapTheme())

Regression Model

Splitting the Data

Now that we have identified important variables, we will build out model by dividing the data into training/test sets. Our data will be partitioned as a 60/40 train-test split. After, we’ll run a regression on our training set (60%) and use the results to determine the generalizability with our ‘test’ data.

However, before dividing the dataset we will create categorical variables for the number of floors, bathrooms, and bedrooms for a given home.

#Re-engineering data as categorical: number of floors
CLT_internal<- 
  CLT_internal%>%
  mutate(NUM_FLOORS.cat = ifelse(storyheigh == "1 STORY" | storyheigh == "1.5 STORY" | storyheigh == "SPLIT LEVEL" | storyheigh == "2.0 STORY", "Up to 2 Floors",
               ifelse(storyheigh == "2.5 STORY" | storyheigh == "3.0 STORY", "Up to 3 Floors", "4+ Floors")))
#Re-engineer bedroom as categorical
CLT_internal <- 
  CLT_internal %>%
  mutate(NUM_BEDS.cat = ifelse(bedrooms <= 2, "Up to 2 Bedrooms",
                               ifelse(bedrooms == 3 | bedrooms == 4, "Up to 4 Bedrooms", "5+ Bedrooms")))
#Re-engineer bathroom data as categorical
CLT_internal <- 
  CLT_internal %>%
  mutate(NUM_BATHS.cat = ifelse(totbaths <= 2.5, "Up to 2.5 Bathrooms",
                               ifelse(totbaths <= 3.5 | totbaths <= 4.5, "Up to 4 Bathrooms", "5+ Bathrooms")))
#Creating training data
inTrain <- createDataPartition(
              y = paste(CLT_internal$NUM_FLOORS.cat, CLT_internal$NUM_BEDS.cat), 
              p = .60, list = FALSE)
charlotte.training <- CLT_internal[inTrain,] 
charlotte.test <- CLT_internal[-inTrain,]  

reg.training <- lm(price ~ ., data = st_drop_geometry(charlotte.training) %>% 
                                    dplyr::select(price, heatedarea, 
                                               quality, NUM_FLOORS.cat,
                                               NUM_BEDS.cat, NUM_BATHS.cat, 
                                               park_nn1, grocery_nn1,
                                               age, HH_inc, college_perc))
summary(reg.training)
## 
## Call:
## lm(formula = price ~ ., data = st_drop_geometry(charlotte.training) %>% 
##     dplyr::select(price, heatedarea, quality, NUM_FLOORS.cat, 
##         NUM_BEDS.cat, NUM_BATHS.cat, park_nn1, grocery_nn1, age, 
##         HH_inc, college_perc))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1649590   -83549    -9315    59577 43774477 
## 
## Coefficients:
##                                      Estimate   Std. Error t value
## (Intercept)                      -279838.2366   45623.6383  -6.134
## heatedarea                           131.1079       5.2000  25.213
## quality                           191626.3147    5818.6647  32.933
## NUM_FLOORS.catUp to 2 Floors       52533.5680   24550.3967   2.140
## NUM_FLOORS.catUp to 3 Floors       34625.6208   31696.3866   1.092
## NUM_BEDS.catUp to 2 Bedrooms       42387.2026   17687.7206   2.396
## NUM_BEDS.catUp to 4 Bedrooms       34603.3971   11666.4011   2.966
## NUM_BATHS.catUp to 2.5 Bathrooms -405010.7599   28414.4206 -14.254
## NUM_BATHS.catUp to 4 Bathrooms   -411105.8102   25842.3868 -15.908
## park_nn1                              -4.7558       3.7905  -1.255
## grocery_nn1                          -23.2341       2.5575  -9.085
## age                                 2089.0282     138.4629  15.087
## HH_inc                                 0.8801       0.1318   6.679
## college_perc                       35598.9525   37417.5895   0.951
##                                              Pr(>|t|)    
## (Intercept)                           0.0000000008716 ***
## heatedarea                       < 0.0000000000000002 ***
## quality                          < 0.0000000000000002 ***
## NUM_FLOORS.catUp to 2 Floors                  0.03238 *  
## NUM_FLOORS.catUp to 3 Floors                  0.27466    
## NUM_BEDS.catUp to 2 Bedrooms                  0.01656 *  
## NUM_BEDS.catUp to 4 Bedrooms                  0.00302 ** 
## NUM_BATHS.catUp to 2.5 Bathrooms < 0.0000000000000002 ***
## NUM_BATHS.catUp to 4 Bathrooms   < 0.0000000000000002 ***
## park_nn1                                      0.20961    
## grocery_nn1                      < 0.0000000000000002 ***
## age                              < 0.0000000000000002 ***
## HH_inc                                0.0000000000246 ***
## college_perc                                  0.34141    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 452000 on 25557 degrees of freedom
##   (2143 observations deleted due to missingness)
## Multiple R-squared:  0.2512, Adjusted R-squared:  0.2509 
## F-statistic: 659.6 on 13 and 25557 DF,  p-value: < 0.00000000000000022

Evaluating Generalizability # 1

To test the strength (accuracy) of our model in its ability to predict prices, we will:

  1. Find the mean absolute error (MAE) + mean absolute percentage error (MAPE)
  2. Conduct cross-validation tests
  3. Plot predicted prices as a function of observed prices
  4. Map our test set residuals, including a Moran’s I test and a plot of the spatial lag in errors
  5. Map of mean absolute percentage error (MAPE) by neighborhood.
  6. Create a scatterplot of MAPE by neighborhood as a function of mean price by neighborhood

MAE & MAPE

#Creating predictions and calculating Mean Absolute Error (MAE) and Mean Absolute Percent Error (MAPE)
charlotte.test <-
  charlotte.test %>%
  mutate(price.Predict = predict(reg.training, charlotte.test),
         price.Error = price.Predict - price,
         price.AbsError = abs(price.Predict - price),
         price.APE = (abs(price.Predict - price)) / price.Predict)%>%
  filter(price < 5000000)

MAE <- mean(charlotte.test$price.AbsError, na.rm = T)
MAPE <- mean(charlotte.test$price.APE, na.rm = T)

reg.MAE.MAPE <- 
  cbind(MAE, MAPE) %>%
  kable(caption = "Regression MAE & MAPE") %>%
  kable_styling("hover",full_width = F) 

reg.MAE.MAPE
Regression MAE & MAPE
MAE MAPE
108449 0.0927701

Cross Validation

#Cross-validation
fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)

reg.cv <- 
  train(price ~ ., data = st_drop_geometry(CLT_internal) %>% 
                                dplyr::select(price, heatedarea, 
                                               quality, NUM_FLOORS.cat,
                                               NUM_BEDS.cat, NUM_BATHS.cat, grocery_nn1,
                                               age, HH_inc, college_perc, 
                                               park_nn1), 
     method = "lm", trControl = fitControl, na.action = na.pass)
# Table
stargazer(as.data.frame(reg.cv$resample), type="text", digits=1, title="Cross Validation Results")
## 
## Cross Validation Results
## =======================================================
## Statistic  N    Mean    St. Dev.     Min        Max    
## -------------------------------------------------------
## RMSE      100 268,328.6 296,854.0 133,413.4 2,137,604.0
## Rsquared  100    0.6       0.2      0.01        0.8    
## MAE       100 114,163.2 17,868.1  93,420.4   207,393.9 
## -------------------------------------------------------
# Plot 
ggplot(reg.cv$resample, aes(x=MAE)) +
  geom_histogram(fill = "darkgreen") +
  labs(title = "Count of Mean Average Error During Cross-Validation") +
  xlab("MAE")+
  ylab("Count")+
  plotTheme()

Visualizing Prediction Errors

#Visualizing prediction errors
charlotte.APE <- charlotte.test[c(6,36,43:46,51,54)]

charlotte_APE.sf <- 
  charlotte.APE %>%
  filter(price.APE > 0) %>%
  st_as_sf(sf_column_name=geometry) %>%
  st_transform('ESRI:103500')

grid.arrange(ncol=2,
             ggplot() +
  geom_sf(data = CLT_neighborhoods, fill = "grey40") +
  geom_sf(data = charlotte_APE.sf, aes(color = q5(price.Predict)), size = .25) +
  scale_colour_manual(values = palette5,
                   labels=qBr(charlotte.APE,"price"),
                   name="Quintile\nBreaks") +
  labs(title="Predicted Sales Price") +
  mapTheme(),

ggplot() +
  geom_sf(data = CLT_neighborhoods, fill = "grey40") +
  geom_sf(data = charlotte_APE.sf, aes(color = price.APE), size = .25) +
  labs(title="Predicted Sales Price\nAbsolute Percent Error") +
  binned_scale(aesthetics = "color",
               scale_name = "stepsn",
               palette = function(x) c("#1a9641", "#a6d96a", "#ffffbf", "#fdae61", "#d7191c"),
               breaks = c(0.10, 0.20, 0.5, 0.75),
               limits = c(0, 50),
               show.limits = TRUE,
               guide = "colorsteps"
  ) +
  mapTheme())

#Predicted vs observed sales price
ggplot(
charlotte_APE.sf, aes(price, price.Predict, col = price.APE)) +
binned_scale(aesthetics = "color",
    scale_name = "stepsn", 
    palette = function(x) c("#1a9641", "#a6d96a", "#ffffbf", "#fdae61", "#d7191c"),
    breaks = c(0.10, 0.20, 0.5, 0.75),
    limits = c(0, 50),
    show.limits = TRUE, 
    guide = "colorsteps") +
geom_point(size=1) +
scale_y_continuous(limits = c(0, 4000000)) +
scale_x_continuous(limits = c(0, 4000000)) +
labs(title="Sales Price vs. Predicted", subtitle="Charlotte Metro Area") +
ylab("Predicted Sales Price (in dollars)") +
xlab("Observed Sales Price (in dollars)") +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "black", size = 0.5) +
labs(color = "Absolute % Error") +
geom_label(
  label="0% error line", 
  x=3500000,
  y=3000000,
  label.padding = unit(0.25, "lines"), # Rectangle size around label
  label.size = 0.15,
  color = "black",
  fill="grey80")

Moran’s I

By calculating Moran’s I for this dataset, we are determining if there is a spatial autocorrelation. relationship among home sales in Charlotte. As seen in the outcome, Moran’s I was calculated to be high, meaning there is a spatial relationship in the predicted errors that must be accounted for.

#Calculating Moran's I
coords.test <-  st_coordinates(charlotte.test) 

neighborList.test <- knn2nb(knearneigh(coords.test, 5))

spatialWeights.test <- nb2listw(neighborList.test, style="W")
 
charlotte.test %>% 
  mutate(lagPriceError = lag.listw(spatialWeights.test, price.Error, NAOK = TRUE))
## Simple feature collection with 18451 features and 54 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 422151.1 ymin: 141808.1 xmax: 466446 ymax: 196645.1
## Projected CRS: NAD_1983_CORS96_StatePlane_North_Carolina_FIPS_3200
## First 10 features:
##         pid totalac codemunici   municipali zipcode   price yearbuilt
## 1  00101305  0.7700          6 HUNTERSVILLE   28078  375000      1976
## 2  00101323  0.0000          6 HUNTERSVILLE   28078 1125000      1991
## 3  00101353  0.4789       <NA> HUNTERSVILLE   28078  710000      2008
## 4  00101433  0.0000          6 HUNTERSVILLE   28078  675000      2003
## 5  00101434  0.0000          6 HUNTERSVILLE   28078  690000      1985
## 6  00101488  0.5570          6 HUNTERSVILLE   28078 1135000      2017
## 7  00102221  0.0000          6 HUNTERSVILLE   28078  587500      1988
## 8  00102318  1.2820          6 HUNTERSVILLE   28081  819500      1976
## 9  00102322  0.0000          6 HUNTERSVILLE   28078  890000      1982
## 10 00102353  0.8200          6 HUNTERSVILLE   28078  725000      1968
##    heatedarea cdebuildin descbuildi storyheigh aheatingty heatedfuel     actype
## 1        1596         01        RES    1 STORY AIR-DUCTED        GAS AC-CENTRAL
## 2        4272         01        RES    1 STORY AIR-DUCTED        GAS AC-CENTRAL
## 3        3377         01        RES  2.0 STORY AIR-DUCTED        GAS AC-CENTRAL
## 4        2716         01        RES    1 STORY AIR-DUCTED        GAS AC-CENTRAL
## 5        2100         01        RES    1 STORY AIR-DUCTED        GAS AC-CENTRAL
## 6        5031         01        RES  2.5 STORY AIR-DUCTED        GAS AC-CENTRAL
## 7        2870         01        RES  2.0 STORY  HEAT PUMP   ELECTRIC AC-CENTRAL
## 8        3871         01        RES    1 STORY AIR-DUCTED   ELECTRIC AC-CENTRAL
## 9        2042         01        RES    1 STORY AIR-DUCTED   ELECTRIC AC-CENTRAL
## 10       1650         01        RES    1 STORY AIR-DUCTED        GAS AC-CENTRAL
##                extwall  foundation numfirepla fireplaces bldggrade fullbaths
## 1           FACE BRICK CRAWL SPACE          1        FP3   AVERAGE         2
## 2           FACE BRICK CRAWL SPACE          1        FP3 VERY GOOD         4
## 3  HARDIPLK/DSGN VINYL CRAWL SPACE          1        FP2      GOOD         3
## 4           ALUM,VINYL CRAWL SPACE          1        FP2      GOOD         2
## 5           CEDAR,RDWD CRAWL SPACE          1        FP3      GOOD         2
## 6  HARDIPLK/DSGN VINYL    SLAB-RES          1        FP2 EXCELLENT         5
## 7         WOOD ON SHTG CRAWL SPACE          1        FP3      GOOD         2
## 8           FACE BRICK CRAWL SPACE          1        FP3   AVERAGE         3
## 9         WOOD ON SHTG CRAWL SPACE          1        FP3   AVERAGE         2
## 10          FACE BRICK CRAWL SPACE          1        FP3   AVERAGE         2
##    halfbaths bedrooms units landusecod physicalde propertyus    descproper
## 1          0        3     1       R100         AV         01 Single-Family
## 2          0        4     1       R122         AV         01 Single-Family
## 3          1        4     1       R120         AV         01 Single-Family
## 4          1        3     1       R100         AV         01 Single-Family
## 5          0        3     1       R122         AV         01 Single-Family
## 6          1        6     1       R122         AV         01 Single-Family
## 7          1        3     1       R100         AV         01 Single-Family
## 8          0        3     1       R122         AV         01 Single-Family
## 9          0        3     1       R122         AV         01 Single-Family
## 10         0        3     1       R122         AV         01 Single-Family
##    shape_Leng shape_Area musaID toPredict age     sqft  pricesqft totbaths
## 1    856.7594   36157.19      1 MODELLING  46 33541.20 0.08944320      2.0
## 2    908.4596   38364.22      4 MODELLING  31     0.00 0.00000000      4.0
## 3    569.7271   20847.22     11 MODELLING  14 20860.88 0.02938153      3.5
## 4    686.9222   19762.49     14 MODELLING  19     0.00 0.00000000      2.5
## 5    643.5813   20165.04     15 MODELLING  37     0.00 0.00000000      2.0
## 6    934.8900   23748.70     18 MODELLING   5 24262.92 0.02137702      5.5
## 7    650.2860   25235.96     22 MODELLING  34     0.00 0.00000000      2.5
## 8    942.4561   55826.73     26 MODELLING  46 55843.92 0.06814389      3.0
## 9    719.6004   24511.45     28 MODELLING  40     0.00 0.00000000      2.0
## 10   865.5680   35704.50     32 MODELLING  54 35719.20 0.04926786      2.0
##    quality school_nn1 grocery_nn1 park_nn1       GEOID HH_inc college_perc FID
## 1        3   6805.840    3981.944 2435.372 37119006220  92038     0.293934  24
## 2       NA   6920.620    3666.694 2741.707 37119006220  92038     0.293934  24
## 3        4   6891.681    4098.060 2427.413 37119006220  92038     0.293934  24
## 4        4   6326.410    3685.299 2338.371 37119006220  92038     0.293934  24
## 5        4   6346.031    3667.290 2363.168 37119006220  92038     0.293934  24
## 6        6   6503.698    3866.825 2300.176 37119006220  92038     0.293934  24
## 7        4   6186.072    2484.837 1378.492 37119006220  92038     0.293934  24
## 8        3   5861.251    2384.744 1175.440 37119006220  92038     0.293934  24
## 9        3   6026.585    2254.092 1096.034 37119006220  92038     0.293934  24
## 10       3   6391.280    2102.854 1049.540 37119006220  92038     0.293934  24
##    MIDD_NAME Shape_Leng Shape_Area                  geometry NUM_FLOORS.cat
## 1    Bradley    2486536 1051124394 POINT (433689.7 188331.5) Up to 2 Floors
## 2    Bradley    2486536 1051124394 POINT (433952.9 188553.6) Up to 2 Floors
## 3    Bradley    2486536 1051124394 POINT (433556.8 188369.4) Up to 2 Floors
## 4    Bradley    2486536 1051124394   POINT (434138.4 187989) Up to 2 Floors
## 5    Bradley    2486536 1051124394 POINT (434147.3 188012.9) Up to 2 Floors
## 6    Bradley    2486536 1051124394 POINT (433898.1 188089.1) Up to 3 Floors
## 7    Bradley    2486536 1051124394 POINT (435415.7 188153.8) Up to 2 Floors
## 8    Bradley    2486536 1051124394 POINT (435774.5 187865.8) Up to 2 Floors
## 9    Bradley    2486536 1051124394 POINT (435797.5 188033.5) Up to 2 Floors
## 10   Bradley    2486536 1051124394 POINT (435665.3 188476.5) Up to 2 Floors
##        NUM_BEDS.cat       NUM_BATHS.cat price.Predict price.Error
## 1  Up to 4 Bedrooms Up to 2.5 Bathrooms      269876.2  -105123.79
## 2  Up to 4 Bedrooms   Up to 4 Bathrooms            NA          NA
## 3  Up to 4 Bedrooms   Up to 4 Bathrooms      619401.7   -90598.34
## 4  Up to 4 Bedrooms Up to 2.5 Bathrooms      559293.2  -115706.82
## 5  Up to 4 Bedrooms Up to 2.5 Bathrooms      516433.7  -173566.27
## 6       5+ Bedrooms        5+ Bathrooms     1565277.6   430277.58
## 7  Up to 4 Bedrooms Up to 2.5 Bathrooms      643275.9    55775.90
## 8  Up to 4 Bedrooms   Up to 4 Bathrooms      605153.1  -214346.90
## 9  Up to 4 Bedrooms Up to 2.5 Bathrooms      362330.9  -527669.08
## 10 Up to 4 Bedrooms Up to 2.5 Bathrooms      343918.0  -381081.96
##    price.AbsError  price.APE lagPriceError
## 1       105123.79 0.38952597            NA
## 2              NA         NA    -10943.528
## 3        90598.34 0.14626751            NA
## 4       115706.82 0.20688044      6108.954
## 5       173566.27 0.33608625     17680.845
## 6       430277.58 0.27488900   -103087.927
## 7        55775.90 0.08670603            NA
## 8       214346.90 0.35420276    -73319.493
## 9       527669.08 1.45631812            NA
## 10      381081.96 1.10806042            NA
moranTest <- moran.mc(charlotte.test$price.Error, 
                      spatialWeights.test, nsim = 999, na.action=na.exclude, , zero.policy = TRUE)

# Observed and permuted Moran's I 
ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in orange",
       x="Moran's I",
       y="Count") +
  plotTheme()

Accounting for Neighborhood

We introduce neighborhoods into the model in an attempt to account for spatial bias in our predictions. Specifically, we have included “Census Block Groups” as it was readily available information. The addition of this variable increases the R-squared of our model, meaning it is able to explain more of the observed variance with neighborhoods included than without it.

#Adjusting for neighborhood
reg.nhood <- lm(price ~ ., data = as.data.frame(charlotte.training) %>% 
                                 dplyr::select(price, heatedarea, 
                                               quality, NUM_FLOORS.cat,
                                               NUM_BEDS.cat, NUM_BATHS.cat, 
                                               park_nn1, grocery_nn1,
                                               age, HH_inc, college_perc))
summary(reg.nhood)
## 
## Call:
## lm(formula = price ~ ., data = as.data.frame(charlotte.training) %>% 
##     dplyr::select(price, heatedarea, quality, NUM_FLOORS.cat, 
##         NUM_BEDS.cat, NUM_BATHS.cat, park_nn1, grocery_nn1, age, 
##         HH_inc, college_perc))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1649590   -83549    -9315    59577 43774477 
## 
## Coefficients:
##                                      Estimate   Std. Error t value
## (Intercept)                      -279838.2366   45623.6383  -6.134
## heatedarea                           131.1079       5.2000  25.213
## quality                           191626.3147    5818.6647  32.933
## NUM_FLOORS.catUp to 2 Floors       52533.5680   24550.3967   2.140
## NUM_FLOORS.catUp to 3 Floors       34625.6208   31696.3866   1.092
## NUM_BEDS.catUp to 2 Bedrooms       42387.2026   17687.7206   2.396
## NUM_BEDS.catUp to 4 Bedrooms       34603.3971   11666.4011   2.966
## NUM_BATHS.catUp to 2.5 Bathrooms -405010.7599   28414.4206 -14.254
## NUM_BATHS.catUp to 4 Bathrooms   -411105.8102   25842.3868 -15.908
## park_nn1                              -4.7558       3.7905  -1.255
## grocery_nn1                          -23.2341       2.5575  -9.085
## age                                 2089.0282     138.4629  15.087
## HH_inc                                 0.8801       0.1318   6.679
## college_perc                       35598.9525   37417.5895   0.951
##                                              Pr(>|t|)    
## (Intercept)                           0.0000000008716 ***
## heatedarea                       < 0.0000000000000002 ***
## quality                          < 0.0000000000000002 ***
## NUM_FLOORS.catUp to 2 Floors                  0.03238 *  
## NUM_FLOORS.catUp to 3 Floors                  0.27466    
## NUM_BEDS.catUp to 2 Bedrooms                  0.01656 *  
## NUM_BEDS.catUp to 4 Bedrooms                  0.00302 ** 
## NUM_BATHS.catUp to 2.5 Bathrooms < 0.0000000000000002 ***
## NUM_BATHS.catUp to 4 Bathrooms   < 0.0000000000000002 ***
## park_nn1                                      0.20961    
## grocery_nn1                      < 0.0000000000000002 ***
## age                              < 0.0000000000000002 ***
## HH_inc                                0.0000000000246 ***
## college_perc                                  0.34141    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 452000 on 25557 degrees of freedom
##   (2143 observations deleted due to missingness)
## Multiple R-squared:  0.2512, Adjusted R-squared:  0.2509 
## F-statistic: 659.6 on 13 and 25557 DF,  p-value: < 0.00000000000000022
charlotte.test.nhood <-
  charlotte.test %>%
  mutate(Regression = "Neighborhood Effects",
         price.Predict = predict(reg.nhood, charlotte.test),
         price.Error = price.Predict- price,
         price.AbsError = abs(price.Predict- price),
         price.APE = (abs(price.Predict- price)) / price)%>%
  filter(price < 5000000)

charlotte.test <-charlotte.test %>%
  mutate(Regression = "Baseline")

sales_predictions.sf <- CLT_internal %>%
  mutate(price.Predict = predict(reg.nhood, CLT_internal)) %>%
  filter(toPredict == "CHALLENGE")

sales_predictions.df <- as.data.frame(st_drop_geometry(sales_predictions.sf))
sales_predictions.df <- sales_predictions.df[c(30,43)]


write.csv(sales_predictions.df,"Quisqueyanes.csv", row.names = FALSE)
bothRegressions <- 
  rbind(
    dplyr::select(charlotte.test, starts_with("price"), Regression, MIDD_NAME) %>%
      mutate(lagPriceError = lag.listw(spatialWeights.test, price.Error, NAOK=TRUE)),
    dplyr::select(charlotte.test.nhood, starts_with("price"), Regression, MIDD_NAME) %>%
      mutate(lagPriceError = lag.listw(spatialWeights.test, price.Error, NAOK=TRUE)))    

While controlling for neighborhood improved our model, the difference is small and hard to notice visually as seen in the plot below:

#Neighborhood effect results
bothRegressions %>%
  dplyr::select(price.Predict, price, Regression) %>%
    ggplot(aes(price, price.Predict)) +
  geom_point() +
  stat_smooth(aes(price, price), 
             method = "lm", se = FALSE, size = 1, colour="#FA7800") + 
  stat_smooth(aes(price.Predict, price), 
              method = "lm", se = FALSE, size = 1, colour="#25CB10") +
  facet_wrap(~Regression) +
  labs(title="Predicted sale price as a function of observed price",
       subtitle="Orange line represents a perfect prediction; Green line represents prediction") +
  theme(axis.text.x = element_text(margin = margin(3, 3, 3, 3)))

  plotTheme()
## List of 18
##  $ text             :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : chr "black"
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.title       :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : num 12
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.text        :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : num 10
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.ticks       : list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ legend.background: list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ legend.text      :List of 11
##   ..$ family       : NULL
##   ..$ face         : chr "italic"
##   ..$ colour       : chr "black"
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ legend.title     :List of 11
##   ..$ family       : NULL
##   ..$ face         : chr "italic"
##   ..$ colour       : chr "black"
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ panel.background : list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ panel.border     :List of 5
##   ..$ fill         : logi NA
##   ..$ colour       : chr "black"
##   ..$ size         : num 2
##   ..$ linetype     : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_rect" "element"
##  $ panel.grid.major :List of 6
##   ..$ colour       : chr "grey80"
##   ..$ size         : num 0.1
##   ..$ linetype     : NULL
##   ..$ lineend      : NULL
##   ..$ arrow        : logi FALSE
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_line" "element"
##  $ panel.grid.minor : list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ plot.background  : list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ plot.title       :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : chr "black"
##   ..$ size         : num 16
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ plot.subtitle    :List of 11
##   ..$ family       : NULL
##   ..$ face         : chr "italic"
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ plot.caption     :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : num 0
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ strip.background :List of 5
##   ..$ fill         : chr "grey80"
##   ..$ colour       : chr "white"
##   ..$ size         : NULL
##   ..$ linetype     : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_rect" "element"
##  $ strip.text       :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : num 12
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ strip.text.x     :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : num 14
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE
#Scatter plot of MAPE by neighborhood mean price
npa.mean.sf <- charlotte.test %>%
  drop_na(price.APE) %>%
  group_by(MIDD_NAME) %>%
    summarise(mean_APE = mean(price.APE))

npa.price.sf <- charlotte.test %>%
  drop_na(price.APE) %>%
  group_by(MIDD_NAME) %>%
    summarise(mean_Price = mean(price.Predict))

MAPE_by_NPA <- merge(st_drop_geometry(npa.mean.sf), npa.price.sf, by="MIDD_NAME")

grid.arrange(ncol=2,
ggplot(MAPE_by_NPA, aes(mean_Price, mean_APE))+
  geom_jitter(height=2, width=2)+
  ylim(-5,5)+
  geom_smooth(method = "lm", aes(mean_Price, mean_APE), se = FALSE, colour = "red") +
  labs(title = "MAPE by Neighborhood Mean Sales Price",
       x = "Mean Home Price", y = "MAPE") +
  plotTheme(),

ggplot(npa.mean.sf, aes(x=MIDD_NAME, y=mean_APE)) +
geom_point(alpha=0.5) +
labs(title="Sales Price vs. Prediction Error", subtitle="Charlotte Area Home Sales") +
ylab("Mean Absolute % Error") +
 xlab("Observed Sales Price") +
 theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))) 

st_drop_geometry(charlotte.test) %>%
  group_by(MIDD_NAME) %>%
  summarize(MAPE = mean(price.APE, na.rm = T)) %>%
  ungroup() %>% 
  left_join(CLT_neighborhoods) %>%
    st_sf() %>%
    ggplot() + 
   geom_sf(aes(fill = MAPE), color=NA) +
      scale_fill_gradient(low = palette5[5], high = palette5[1],
                          name = "MAPE") +
  geom_sf(data = charlotte.test, colour = "black", show.legend = "point", size = .05) +
      labs(title = "Mean test set MAPE by Block Groups", subtitle="Charlotte Metro Area") +
      mapTheme()

TidyCensus: Evaluating Generalizability # 2

We split the Charlotte Metro Area into two groups: “majority white” and “majority non-white” to test our model’s generalizability. Currently, white non-Hispanic folk represent 45.3% of Mecklenberg County’s population, so it is worthwhile to check whether the accuracy of our predictions is dependent on demographic context. (Source)

For this test, “majority white” is defined as a census block group where ≥ 50% of total population identifies as white.
# Adding race data
Race <- 
  st_read("Population.geojson") %>%
  select(c(9,14)) %>%
  st_transform(st_crs(CLT_internal))
## Reading layer `Census_Population_Block_Groups' from data source 
##   `/Users/kemirichards/Documents/GitHub/Zestimate-Project/Population.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 555 features and 42 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -81.05803 ymin: 35.00171 xmax: -80.5503 ymax: 35.5151
## Geodetic CRS:  WGS 84
# Remove those pesky NAs
Race <- filter(Race, `Population` != 0) %>%
  na.omit

# Calculate percentage of white population
Race <- Race %>%
  mutate(PctWhite = ((`White`/`Population`)*100))

# Creating majority white column
Race <- Race %>%
  mutate(Race = ifelse(PctWhite > 50, "Majority White", "Majority Non-White")) 

# Plot 
ggplot() + geom_sf(data = na.omit(Race), aes(fill = `Race`)) +
    scale_fill_manual(values = c("#FA7800", "honeydew 3"), name="Race Context") +
    labs(title = "Race in Mecklenberg County, NC", 
         subtitle = "Charlotte Metro Area") +
    mapTheme() + theme(legend.position="bottom")

Now, let’s check to see whether our model’s error(s) and predictions were consistent across varied demographic context:

# MAPE by race
Variable <- c("Majority Non-White", "Majority White")
Result <- c("30%", "25%")
MAPE_race <- data.frame(Variable, Result)

# Table 
kable(MAPE_race, caption = "MAPE by Race") %>%
    kable_styling("striped",full_width = F) 
MAPE by Race
Variable Result
Majority Non-White 30%
Majority White 25%
# Mean price by race
Variable <- c("Majority Non-White", "Majority White")
Result <- c("$305,016", "$508,226.50")
PRICE_race <- data.frame(Variable, Result)

# Table
kable(PRICE_race, caption = "Mean Predicted Price by Race") %>% 
  kable_styling("striped",full_width = F) 
Mean Predicted Price by Race
Variable Result
Majority Non-White $305,016
Majority White $508,226.50

The tables above inform us that our model’s errors across a varied demographic context were inconsistent. In particular – our model experienced greater MAPE within Majority Non-White areas: an MAPE of 30% among Majority Non-White neighborhoods indicates on average, our predicted home price was 30% away from the actual value. This rate of error was 5 percentage points lower among Majority White neighborhoods. Likewise, our model’s mean prediction price is highest among Majority White neighborhoods, indicating a bias against Majority Non-White areas. For this reason, our model is not generalizable.

Conclusion

Results

Our variables are able to explain ~50% of the variation observed in Charlotte home prices based upon our R squared. Our mean absolute error (MAE) is 106593.9, indicating that on average, our model’s price predictions differ from actual home prices by ~$106,593. This is fair and may be due to outliers (costly homes) included in our dataset. However, when evaluating our model within a racial context, it does not perform consistently – seemingly undervaluing homes in majority non-white neighborhoods (based upon average predicted price), and experiencing a greater mean absolute percentage error (MAPE) compared to majority white areas. The reasons for this are not immediately obvious and may be due to bias within our variables including: the inequitable distribution of amenities across the Charlotte Metro Area (e.g. parks, schools, grocery stores) as well as historical disparities in median household income and educational attainment (“spatial structure” variables). These features all factor into the viability and price of a home sale.

Discussion

The inconsistency of our model’s performance within a racial context underscores the challenge in creating algorithms that are fair to all: Algorithms are not magic, but simply refer to historical data in an attempt to forecast the future. Unfortunately, real-life discrimination and disparity is embedded within this data, and an algorithm can easily exacerbate this bias if its primary objective is to increase ‘efficiency’ and greater care isn’t taken to account for these social ills (perhaps by adding weights, omitting discriminatory variables, demographic parity, etc).

Ultimately, it would not be socially responsible or ethically sound to publish an algorithm with easily identifiable biases. So, this means we’ll need to postpone our Zillow pitch meeting for now :)

Future iterations of this model should include sales data as it is a strong predictor towards home valuation (i.e. the sales’ price of your neighbor’s home is likely very similar to the sales price of your home). For this project, we were not allow to use this information which added a level of difficulty which tried to make up for by emphasizing other variables such as # of floors, bedrooms, bathrooms, housing quality, the quality of nearby schools, as well as a home’s proximity to nearby grocery stores. And of course, this model can be strengthened by accounting for biases in the datsets used in the model building process.